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ABSTRACT 

An  acoustic  confidence  measure  for  acceptance/rejection 
of  recognition  hypotheses  for  continuous  speech  utterances 
is  proposed.  This  measure  is  useful  for  rejecting  utterances 
that  are  out  of  domain,  or  contain  out-of-vocabulary  words 
or  speech  disfluencies.  A  phone-based  approach  is  imple¬ 
mented  so  that  a  single  global  threshold  can  be  applied  to 
hypothesis  rejection  for  any  word  sequence.  Phone  confi¬ 
dence  is  computed  for  each  frame  of  speech  as  the  posterior 
phone  probability  given  the  acoustic  observation.  Word  se¬ 
quence  confidence  is  evaluated  as  the  average  phone  confi¬ 
dence,  either  by  weighting  all  frames  equally  or  by  normal¬ 
izing  by  phone  duration.  The  confidence  measure  is  tested 
on  a  database  of  spoken  company  names.  When  normal¬ 
ized  by  phone  duration,  it  achieves,  in  some  cases  with  less 
computational  expense,  rejection  performance  comparable 
to  a  baseline  system  implementing  a  common  filler-model 
approach.  When  all  frames  are  equally  weighted,  perfor¬ 
mance  is  substantially  poorer. 

1.  INTRODUCTION 

When  continuous  speech  recognition  systems  are  fielded  to  a 
large  community  of  users,  especially  infrequent  users  (e.g., 
callers  of  a  telephone  information  service),  it  is  common 
that  many  spoken  inputs  do  not  fall  within  the  domain 
that  the  recognition  system  is  designed  to  handle.  This 
may  be  due  to  speech  disfluencies  on  the  part  of  the  user 
(e.g.,  hesitations,  word  fragments,  corrections),  an  incom¬ 
plete  language  model  (i.e. ,  the  system  does  not  model  all 
the  word  strings  users  say),  or  a  poor  understanding  of  the 
domain  by  the  user  (i.e.,  the  user  does  not  understand  the 
range  of  inputs  allowable  at  that  point  in  the  interaction). 
The  ability  to  reject  out-of-domain  utterances  is  essential 
for  the  design  of  user-friendly  interfaces. 

A  number  of  rejection  approaches  have  been  suggested  in 
the  past  for  rejection  of  putative  hits  in  keyword  spotting 
(e.g.,  [1,  2,  3,  4,  5,  6,  7]),  for  detection  of  out-of- vocabulary 
words  (e.g.,  [8]),  and  for  utterance  rejection  (e.g.,  [8,  9]). 
Some  of  these  systems  use  a  filler  model  to  match  non¬ 
keyword  speech.  A  typical  filler  model  is  a  set  of  context- 
independent  phonetic  models.  Also,  some  systems  use  anti¬ 
keyword  models.  For  example,  in  the  digit  recognition  sys¬ 
tem  in  [9],  for  each  digit,  an  anti-digit  model  was  trained 
on  all  digits  except  the  target  digit.  A  central  issue  in  all 
these  approaches  is  the  normalization  of  acoustic  likelihood 


scores  of  recognition  hypotheses. 

We  propose  a  phone-based  confidence  measure  for  reject¬ 
ing  recognition  hypotheses.  A  recognition  hypothesis  (for 
an  uttered  word  sequence)  is  rejected  if  its  overall  confi¬ 
dence  score  falls  below  a  threshold.  Two  variations  of  the 
phone-based  confidence  measure  are  compared.  Although 
we  demonstrate  here  the  application  of  our  rejection  strat¬ 
egy  to  a  system  without  keyword  spotting  ability  (i.e.,  the 
case  when  the  only  acceptable  inputs  are  in-domain  spoken 
word  sequences  unaccompanied  by  extraneous  speech),  the 
same  strategy  can  be  used  for  rejecting  putative  keyword 
hits  while  wordspotting. 

Section  2  describes  the  confidence  measure,  Section  3  de¬ 
scribes  experimental  results,  and  Section  4  presents  conclu¬ 
sions  and  future  directions. 

2.  PHONE-BASED  CONFIDENCE  MEASURE 

Let  PH  =  {PH1,PH2,...,PHn}  be  a  Viterbi  de¬ 

coded  sequence  of  phones  for  a  spoken  utterance. 
Let  O  =  {Oi ,  O2,  ...,  Ot}  be  the  acoustic  obser¬ 
vation  sequence  for  the  utterance.  Equivalently, 

o  =  {Ob[i], ...,  Oe[i],  Ob[2], ...,  Oe[2], ...,  Ob[jv] ,  •••,  Oe[jv]}, 

where  b[i]  and  e[i\  denote,  respectively,  the  beginning  and 
ending  frames  of  the  ith  phone.  Note  that  6[1]  =  1  and 
e[A]  =  T.  Although  our  recognition  system  uses  context- 
dependent  phones,  context-independent  phones  are  used  (for 
implementation  reasons)  to  calculate  the  acoustic  confi¬ 
dence  measure  (ACM),  for  which  there  are  two  variations, 
AC  Mi  and  ACM2.  AC  Mi  (Equation  1)  is  the  average 
per-frame  log  phone  posterior  probability.  AC M2  (Equa¬ 
tion  2)  is  the  average  duration-normalized  log  phone  poste¬ 
rior  probability.  The  important  distinction  is  that  ACMi 
weights  all  frames  equally  in  their  contribution  to  the  over¬ 
all  confidence,  whereas  AC M2  weights  all  phones  equally. 

Equation  3  defines  the  posterior  phone  probability  for  Ot. 
In  Equation  3,  the  local  acoustic  observation  likelihood  for 
a  given  phone,  p(Ot\PHj),  is  computed  as  the  maximum 
of  the  likelihood  scores  of  the  acoustic  observation  over  all 
3  states  of  the  context-independent  phone  hidden  Markov 
model  (HMM).  The  denominator  of  Equation  3  is  a  sum 
over  all  context-independent  phone  HMMs  in  a  phonetically 
tied  mixture  system  (a  system  in  which  only  HMM  states 
that  belong  to  allophones  of  the  same  phone  share  the  same 
mixture  components).  Note  that  for  a  phonetically  tied 
mixture  system,  the  denominator  is  exactly  p(Ot),  the  un¬ 
conditional  likelihood  of  the  acoustic  observation,  consider- 


ing  all  context-dependent  phone  models  in  the  system.  This 
is  not  the  case  for  a  general  genonic  tied  mixture  system 
[10]. 
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3.  EXPERIMENTAL  RESULTS 

We  conducted  experiments  using  a  database  of  company 
names  spoken  over  the  telephone  to  an  HMM-based  pho¬ 
netically  tied  mixture  continuous  speech  recognition  sys¬ 
tem.  The  a  priori  context-independent  phone  probabilities, 
needed  for  Equation  3,  were  available  from  an  independent 
database.  The  recognition  task  is  to  recognize  which  of 
12,000  company  names  was  spoken.  The  test  set  contains 
916  utterances.  Approximately  one  third  of  these  (296  ut¬ 
terances)  are  not  valid  for  the  application,  containing  either 
no  company  name  in  the  utterance,  a  company  name  not 
handled  by  the  system,  an  acceptable  company  name  with 
extraneous  speech,  or  a  subtle  variant  of  an  acceptable  com¬ 
pany  name.  In  the  test  set,  there  are  many  tokens  of  such 
subtle  variants  as  well  as  many  in-domain  company  names 
that  differ  by  only  one  or  two  phones.  This  makes  the  task 
rather  challenging.  The  296  utterances  are  what  we  refer 
to  as  Out-of-domain  utterances.  The  goal  is  to  maximize 
rejection,  or  equivalently,  to  minimize  false  acceptance  of 
these  out-of-domain  utterances.  All  other  utterances  are 
considered  in  domain,  and  should  not  be  rejected. 

Figure  1  shows  recognition  accuracy  on  the  in-domain 
utterances  as  a  function  of  false  acceptance  rate  on  the  out- 
of-domain  utterances.  The  baseline  uses  a  rejection  model 
implemented  previously  in  a  number  of  other  systems  (e.g., 
[3]),  based  on  a  filler  model  consisting  of  a  set  of  context- 
independent  phones.  A  weight  is  used  to  adjust  the  tradeoff 
between  correct  acceptance  (i.e. ,  not  rejecting  an  in-domain 
utterance)  and  correct  rejection  performance  (i.e.,  rejecting 
an  out-of-domain  utterance).  The  ACM  i  and  AC  M2  sys¬ 
tems  use  a  threshold  to  determine  when  to  reject  a  recogni¬ 
tion  hypothesis.  For  computing  the  hypothesis  confidence 
via  ACM  1  and  AC  M2,  regions  recognized  as  non-speech 
were  ignored. 

It  is  clear  from  the  figure  that  the  confidence  measure  that 
weights  all  frames  equally  (AC Mi)  performs  significantly 
worse  than  that  which  weights  all  phones  equally  (AC M2) 
for  all  false  acceptance  rates.  One  possible  reason  for  this  is 
the  following.  When  the  recognized  word  sequence  shares 
many  phones  with  the  correct  word  sequence,  but  has  sev¬ 


Figure  1.  Rejection  Method  Performance  Compar¬ 
ison 


eral  extra  phones,  the  corresponding  phone  HMMs  for  these 
extra  phones  must  be  traversed  across  an  acoustic  observa¬ 
tion  region  which  corresponds  to  uttered  phones  that  are 
different  from  those  recognized.  Typically,  in  order  to  get 
the  best  recognition  match,  these  phones  will  have  minimal 
duration  in  the  Viterbi  backtrace.  In  our  system,  the  mini¬ 
mal  duration  is  3  frames  for  our  3-state  phone  models.  Fur¬ 
thermore,  since  these  recognized  phones  are  incorrect,  they 
typically  have  very  poor  likelihood  scores.  In  these  cases, 
the  confidence  measure  more  sensitive  to  these  3  frames  of 
very  poor  likelihood  scores  would  be  able  to  identify  the  mis- 
recognition  and  reject.  Since  AC  M2  weights  phones  equally, 
these  very  poor  likelihood  scores  would  have  more  weight. 
In  contrast,  for  (ACM  1 ),  since  these  phones  have  a  minimal 
Viterbi  duration  (3  frames  in  our  system),  they  would  have 
less  weight.  We  have  started  to  investigate  this  theory  and 
there  is  anecdotal  evidence  that  it’s  valid. 

AC  M2  achieved  the  same  recognition  accuracy  as  the 
baseline  filler  model  for  false  acceptance  rates  greater  than 
22  percent.  In  some  cases,  the  AC  M2  approach  may  be 
less  expensive  to  implement  than  the  filler-model  approach. 
At  lower  false  acceptance  rates,  the  baseline  model  outper¬ 
formed  both  ACM  1  and  AC  M2,  although  the  recognition 
accuracy  for  all  methods  was  quite  poor  in  this  region. 

4.  CONCLUSIONS  AND  FUTURE 
DIRECTIONS 

An  acoustic  confidence  measure  (ACM)  for  word-string  hy¬ 
potheses  is  proposed.  The  hypothesis  confidence  is  evalu¬ 
ated  as  the  average  phone  confidence.  We  experimented 
with  two  variations  of  the  acoustic  confidence  measure, 
one  that  weights  all  frames  equally  (ACM  1),  and  one  that 
weights  all  phones  equally  by  normalizing  for  phone  dura¬ 
tion  (AC  M2). 

AC  M2  provided  performance  comparable  to  our  baseline 
system,  that  uses  a  set  of  context-independent  phones  as 


a  filler  model.  The  AC  M2  scheme  may  be  less  expensive 
to  implement  than  the  filler-model  approach  in  some  cases. 
ACMi  provided  significantly  worse  performance.  There  is 
anecdotal  evidence  that  this  poorer  performance  is  due  to 
AC  Mi’s  lower  sensitivity  to  outlier,  very  poor  hkelihood 
scores  that  occur  in  minimal-duration  phones  typically  in¬ 
dicative  of  a  misrecognition  in  which  the  hypothesis  shares 
many  phones  with,  but  has  several  more  phones  than  the 
correct  word  sequence. 

One  issue  that  we  wish  to  address  next  is  normalizing  the 
phone  confidences  with  respect  to  phone  model  performance 
[11].  Also,  from  our  comparison  of  AC  Mi  and  AC  M2,  it 
seems  it  would  be  advantageous  to  incorporate  durational 
information  in  confidence  scoring  (i.e. ,  rather  than  just  nor¬ 
malizing  for  duration).  In  addition,  we  wish  to  use  context- 
dependent  phone  models  to  evaluate  confidence  measures  in 
order  to  improve  the  estimation  of  posterior  phone  proba¬ 
bilities.  Finally,  we  would  like  to  apply  our  approach  to  a 
keyword  spotting  system  in  which  we  would  compute  word- 
level  confidences  as  an  average  of  the  phone  confidences  for 
the  phones  making  up  the  word.  For  this  task,  the  con¬ 
fidence  measure  presented  here  could  be  used,  and,  with 
proper  normalization  [11],  a  single  threshold  could  accom¬ 
modate  all  keywords,  eliminating  the  problem  of  determin¬ 
ing  thresholds  for  keywords  that  are  uncommon. 
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